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This article establishes the performance of stochastic blockmodels 
in addressing the co-clustering problem of partitioning a binary array 
into subsets, assuming only that the data are generated by a nonpara- 
metric process satisfying the condition of separate exchangeability. 
We provide oracle inequalities with rate of convergence C'p(n~^^*) 
corresponding to profile likelihood maximization and mean-square er- 
ror minimization, and show that the blockmodel can be interpreted 
in this setting as an optimal piecewise-constant approximation to the 
generative nonparametric model. We also show for large sample sizes 
that detection of co-clusters in such data indicates with high probabil- 
ity the existence of co-clusters of similar proportion and connectivity 
in the generative process. 



1. Introduction. Blockmodels are popular tools for network modeling 
that see wide and rapidly growing use in analyzing social, economic, and 
biological systems; see [13, 26] for recent overviews. This article establishes 
the performance of stochastic blockmodels for the co-clustering problem [15, 
24] of partitioning a binary array into subsets, assuming only that the data 
are generated by a general nonparametric process satisfying the condition of 
separate exchangeability [12]. This significantly generalizes known results for 
the blockmodel and its co-blockmodel variant, which have only recently been 
established under the requirement that the model be correctly specified [3, 
4, 9, 10, 14, 15, 23, 24, 27]. 

The stochastic blockmodel provides a natural parametric approximation 
in the nonparametric setting we consider [4] . We quantify this notion by de- 
riving an oracle inequality which states that the maximum profile likelihood 
estimate asymptotically minimizes a Kullback-Leibler divergence risk func- 
tional linking the class of co-blockmodels and the nonparametric generative 
process, and also give rates at which it is possible to asymptotically mini- 
mize risk. Additionally, we show that detection of co-clusters by either 
of these methods implies with high probability the existence of co-clusters 
of similar proportion and connectivity in the generative process. 
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We also show that the co-blockmodel can identify extremal clusterings 
in data — network communities — even if the actual generative process is far 
from a blockmodel. Our results thus motivate a variety of practical algo- 
rithms for statistical network analysis in the area of community detection. A 
great deal of attention and effort has been devoted to this task [13, 16, 22, 
26], but the qualitative interpretability of results remains greatly hampered 
by the fact that the validity of any chosen model in correctly specifying 
cluster-like behavior is often highly debatable. 

Our results imply that community detection can instead be understood 
quantitatively as finding a best piecewise-constant or simple function ap- 
proximation to a flexible nonparametric process. In settings where the un- 
derlying generative process is not well understood and the specification of 
highly structured models is thus premature, such an approach is natural for 
exploratory data analysis. Such usage of blockmodel estimates has even been 
likened to — and may even come to be as canonically accepted as — the use 
of histograms to characterize exchangeable data in non- network settings [3] . 

The article is organized as follows. In Section 2, we introduce our nonpara- 
metric setting and the stochastic co-blockmodel. In Section 3 we present ora- 
cle inequalities for co-clustering based on blockmodel fitting. In Section 4 we 
derive a general consistency result, showing that the collection of extremal 
co-clusterings of the data converges to that of a generative nonparametric 
process. We prove this result in Section 5, by combining a construction used 
to establish a theory of graph limits [5, 6, 7] with statistical learning theory 
results on U-statistics [11]. In Section 6 we conclude with a brief discussion 
of our results and their relation to other recent papers, such as [9] and [15]. 
Appendices A and B contain additional proofs and technical lemmas. 

2. Model elicitation. Denote by G = {Vi,V2,E) an observed bipar- 
tite graph with edge set E and vertex sets (Vi, V2), where we assume known 
assignment of vertices to Vi or V2. For example, Vi and V2 might respec- 
tively represent people and locations, with edge {i,j) denoting that person 
i frequents location j. Let A be the adjacency matrix of G. 

2.1. Exchangeable graph models. Exchangeability implies that the node 
ordering of a graph carries no information [3, 18]. For a bipartite graph 
represented as a binary array A, the appropriate notion is as follows. 

Definition 2.1 (Separate exchangeability [12]). Let {Aij}f°^^ be binary 
random variables. They are said to be separately exchangeable if 



P{Aij = Xij, 1 <i,j <n) = P{Aij = Xni{j)n2(i), 1 <hj< n) 
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for alln = 1,2, .. ., all permutations Hi, 112 ofl,...,n, and all X G {0, l}"^". 

If we identify a finite set of rows and columns of A with the adjacency 
matrix of a bipartite graph, then it is clear that the notion of separate ex- 
changeability encompasses a broad class of network models. Indeed, given a 
single sample of an unlabeled graph, it is natural to consider models that are 
invariant to permutation of its adjacency matrix; see [3, 18] for discussion. 

The assumption of separate exchangeability is the only one we will require 
for our results to hold. A representation of models in this class is given by 
the Aldous-Hoover theorem for separately exchangeable binary arrays. 

Definition 2.2 (Exchangeable array model). Fix a measurable mapping 
oj : [0,1]'^ —7- [0,1]. Then the following model generates an exchangeable 
random bipartite graph G = (yi,V2,E) through its adjacency matrix A. 

1. Generate a ~ Uniform(0, 1); 

2. Fixm = \ Vi\ andn = IV2I, and generate each element of^={(^i,...,^jn) 
and C = (Ci, . . . , Cn) *~ Uniform(0, 1); 

3. Fori = 1, . . . , m, andj = 1, . . . , n, generate Aij Bernoulli {Lo'^{^i, Cj)), 
where oj{x,y) = u)°'{x,y) denotes the function {x,y) ^ oj{a,x,y). If 
Aij = 1, then connect vertices i £ Vi and j £ V2. 

The Aldous-Hoover theorem states that this representation is sufficient 
to describe any separately exchangeable network distribution. 

Theorem 2.1 (Aldous-Hoover [12]). Let {^ij}^=i be a separately ex- 
changeable binary array. Then there exists some u : [0,1]^ — >• [0,1], unique 
up to measure-preserving transformation, which generates {Aij}fj^i. 

The interpretation of the exchangeable graph model of Definition 2.2 is 
that each vertex has a latent parameter in [0, 1] (^j for vertex i in Vi, and 

for vertex j in V2) which determines its affinity for connecting to other 
vertices, while a is a network-wide connectivity parameter (non-identifiable 
from a single network sample). Because ^ and are latent, oj[x,y) itself 
is identifiable only up to measure-preserving transformation, and is hence 
indistinguishable from any mapping {x,y) 1— )• ti;(Q, 7ri(x), 7r2(y)) for which 
vTi, 7r2 are in the set V of measure-preserving maps on [0, 1]. We will identify 
members of this equivalence class in the sequel. 

2.2. The stochastic co-blockmodel. Many popular network models can be 
recognized as instances of Definition 2.2. For example, [1, 19, 20] all present 
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models in which the resulting oj{a,x,y) is constant in a, while [21] requires 
the full parameterization uj{a,x,y). The stochastic co-blockmodel specifies 
u){a, X, y) constant in a and also piecewise-constant in x and y, and thus can 
be viewed as a simple function approximation to u}{x,y) in Definition 2.2. 

Definition 2.3 (Stochastic co-blockmodel). Fix integers Ki, K2 > 0, a 
matrix 6 G [0,1]^^^^^, and discrete probability distributions fi and u over 
l,...,Ki and 1,...,K2. Then the stochastic co-blockmodel generates an 
exchangeable bipartite graph G = (Vi, V2, E) through the matrix A as follows. 

1. Fix m = \Vi\ and n = IV2I, and generate S = (^(l), . . . ,S{7n)) *~ fi 
andT= (r(l),... ,r(n)) ~ i/; 

2. For i = I, . . . ,m, and j = 1, . . . , n, generate Aij *~ Bernoulli {ds{i)T{j)) ■ 
If Aij = 1, then connect vertices i £ Vi and j £ ¥2- 

Additionally, given co-blockmodel parameters (j) = (/i,z^, 0), define 

as the mapping corresponding to Definition 2.2, with F^^{x) = inf2{i^^(2:) > 
x} the inverse distribution function corresponding to a given distribution fj,. 

Without loss of generality we assume Ki = K2 = -ftT in the sequel, noting 
that our results do not depend in any crucial way on this assumption. Thus, 
vertices in Vi belong to one of K latent classes, as do those in V2. The matrix 
G [0, \^^^^ indexes the corresponding connection affinities between classes 
in Vi and V2. S and T are assumed latent, the stochastic co-blockmodel 
is identifiable only up to a permutation of class labels. 

3. Oracle inequalities for co-clustering. If we assume that the sepa- 
rately exchangeable data model of Definition 2.2 is in force, then it is natural 
to approximate uj{x,y) by way of uj^{x,y) according to the stochastic co- 
blockmodel of Definition 2.3. This approximation task is equivalent to fixing 
K and estimating (p = (/i, i/, 9) by co-clustering the entries of an observed 
adjacency matrix A £ {0, l}"*^"- corresponding to a bipartite network. 

To accomplish this task, we consider M-estimators that involve an opti- 
mization over latent blockmodel variables S and T. To describe these estima- 
tors, we must consider the set of all possible co-clusterings of A. To this end, 
let the set $ contain all (/i, u, 9) of the form (//, v, 9) G VLm x f]„ x [0, 1]^^^ , 
where Vtm denotes the set of all probability distributions over {1,... , i^} 
whose elements are integer multiples of 1/m: 

^™={f e{0,i„^,...,l}^:Ef=iPa = l}, 
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and let Q™ denote the set of all assignment functions that partition the set 
{1, . . . , m} into K classes in a manner that respects the proportions dictated 

Q^ = {ve {1, . . . : \v-\a)\ = mfi^, a=l,...,K}. 

Note that by construction, any estimator (p{A) = (fi, z), 9) based on an em- 
pirical co-clustering of the observed data A G {0, l}™-^" has codomain <1>. 

We now establish that, for risk and Kullback-Leibler divergence, there 
exist M-estimators that enable us to determine, with rate of convergence 
n~^/^, optimal piecewise-constant approximations of the generative uj{x,y), 
up to quantization due to the discreteness of 

Theorem 3.1 (Oracle inequalities for co-clustering). Let A £ {0, 1}™^" 
be a separately exchangeable array generated by some u in accordance with 
Definition 2.2, and consider fitting a K -class stochastic co-blockmodel pa- 
rameterized by cj) = {fi, u, 9) to A. Then as n ^ oo, with K and m/n fixed, 

1. For the co-blockmodel M-estimator 
relative to the risk 



R^{(j)) = inf / \uj{TTi{x),Tr2{y)) - uj^{x,y)\ dxdy, 
we have that 

2. For the profile likelihood co-blockmodel M-estimator 



(3.2) ^ = argmax j max^^^ W g g ^^'^ log(^5WT0)) 



+(1-Ai,)log(l-%)TO))} 



relative to 



L^{(l)) = sup / {uj{TTi{x),TT2{y))logu}^{x,y) 

+ [1 - w(7ri(3;),7r2(y))]log(l -uj^{x,y))} dxdy, 
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we have whenever argmax^g^ L^(0) exists that 

i?(argmax ((/)))+ 
with B{ct>) = B{e{4>)) = maxi<,,fe<^ | log (0,^/(1 - Oab)) |. 

Theorem 3.1 is proved in Appendix A. Its first result establishes that 
minimization of the squared error between a fitted co-blockmodel and the 
data according to (3.1) serves as a proxy for approximation of oj by 
in mean square, while its second result establishes that fitting a stochas- 
tic co-blockmodel via profile likelihood according to (3.2) is equivalent to 
minimizing the average KuUback-Leibler divergence of the approximation 
ijj(j,{x,y) from the generative uj{x,y). 

While the necessary optimizations in (3.1) and (3.2) are not currently 
known to admit efficient exact algorithms, they strongly resemble existing 
objective functions for community detection for which many authors have 
reported good heuristics [16, 22, 26]. Furthermore, polynomial-time spectral 
algorithms are known in certain settings to find correct labelings under the 
assumption of a generative blockmodel [14, 23], suggesting that efficient 
algorithms may exist when distinct clusterings or community divisions are 
present in the data. In this vein, a universal thresholding procedure based 
on the singular value decomposition has been very recently proposed in [9]. 

Remark 3.1. The objective function of (3.2) can be replaced by the full 
profile likelihood 



max 



^ log /i5(i) +^ log Z^TO) 

i=l j=l 



m n 



+ {^^i Gs(i)TU) + (1 - Aij) log(l - 0s{i)TO))} > , 

i=l j=l J 

and the same rate of convergence can then be established with respect to the 
corresponding term for L^{^), adapting the proofs in Appendices A and B. 

Remark 3.2. Let (p = argmax^g^ L^{(j)). Terms B{(f)) and B{(j)) in (3.3) 
indicate that elements of and must be bounded away from zero and one 
as n ^ oo; otherwise Luj{(p) can be much smaller than Li^{(j)). 
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This is a natural consequence of the fact that the Kullback-Leibler di- 
vergence of LOff) from uj is finite if and only if uj is absolutely continuous 
with respect to w,^. To see that this must be the case, consider ^, and A 
generated according to the model of Definition 2.2 with lo constant in a as 

, , fl ifx<l/2,y<l/2, 
uj{x,y) = < 

I otherwise. 

Let m = YlZi 1{6 < 1/2}, and let vi = n'^ Yl]=i Hd < 1/2}- It can 
be seen that ui^ corresponding to the maximum-likelihood blockmodel fit is 

jl if X < fii,y < 1^1, 
y) = \ 

''' I otherwise. 

Thus L^{(j)) = — oo unless pii = vi = 1/2. 

4. Convergence of extremal co-clusterings. Theorem 3.1 provides 
oracle inequalities which state that a co-blockmodel fitted to separately ex- 
changeable network data will be near-optimal in terms of minimizing risk. 
We now show that the fitted model is interpretable, in the sense that co- 
clusters of similar proportion and connectivity to those fitted will exist with 
high probability in the generative process of Definition 2.2. 

Our next theorem can also be seen as a general consistency result in 
its own right, and is the main technical tool necessary to obtain the rate 
of convergence Op(n~^/^) in Theorem 3.1. It states that the collection of 
extremal co-clusterings of the data converges to the collection of extremal 
co-clusterings of the generative nonparametric process of Definition 2.2. 

As in Section 3, we require a means of indexing the set of all possible 
co-clusterings of any bipartite adjacency matrix A and nonparametric gen- 
erating function oj{x^ y). To this end, for distributions /i, z^, adjacency matrix 
A e {0, 1}™X", and w : [0, 1]^ [0, 1] as in Sections 2 and 3, we wih define 
sets of matrices J-"^ and J^^^ to represent the set of possible co-clusters that 
can be induced, respectively from the data and from the generative process, 
by all partitions with class proportions specified by /.t and u. 

Our main result is that the convex hulls oiF^^ and J"^^ converge, implying 
convergence of minimum-risk estimates for any risk functional that achieves 
its minimum for fixed and at an extremal point of the convex hull of the 
feasible set. To make this notion precise, we first require the following. 

4.1. Representing sets of co- clusterings and their extrenia. Given a bi- 
partite graph G = (yi,V2,E) with adjacency matrix A G {0,1}*"^", let 
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S* G {1, . . . , K}'^ and T G {1, . . . , K}^ partition Vi and V2 respectively into 
K subsets. Let A/ ST G [0, l]^^^ count the number of edges spanning each 
subset pair, normalized by total number mn of possible edges: 

iA/ST){a,b) = — V V Aij, a,b=l,...,K. 

Similarly, given w : [0, 1]^ — [0, 1] and mappings cj, t : [0, 1] — {1, . . . , K}, 
let w/fJT G [0, V^^^ encode the mass of uj assigned to each subset pair. 

{uj / aT){a^b) = I uj{x,y)dxdy, a,b = l,...,K. 

Jcr-^{a)XT-^(h) 

Let be defined as in Section 3, and let analogously denote the set 
of partitions of [0, 1] into K subsets whose cardinalities are of proportions 
/il, . . . ,^K'- 

Q^ = [a : [0, 1] ^ {1, . . . ,K} such that \a~^{a)\ = Ha, a = 1, . . . , K} . 
We are now ready to state our required definitions. 

Definition 4.1 (Sets T^^^ and J^'^^, of possible co-clusters). For fixed 
discrete probability distributions /U and v over 1,...,K, we define the sets 
F'^^ and J"^^ of all co-clustering matrices A/ ST and mappings uj/ar induced 
by (5, T) G Q;7 X and {a, t) £ x as follows: 

T^, = {A/ST:SeQ^,TeQ:}, 
T^, = {io/aT:aeQf„TeQ,}. 

Definition 4.2 (Support function). Let T be a non-empty subset of 
W , endowed with the standard inner product (r,F) = tr(r'^F). We 
define the support function hjr : M^^^ ^ {+00} of F C M^^^ as 

h^{T) = sup(r,F). 

4.2. A general result on consistency of co- clustering. The sets T^^^T^^ C 
[0, describe all possible co-clusterings that can be induced respec- 

tively from A and w with respect to ;U and v. The support function /ij-(r) 
defines the supporting hyperplanes of such an T in any direction T, thereby 
providing a representation of its closed convex hull. Prom the above we see 

(4.1a) hrAiV)= max (r,A/5r), 

{S,T)eS™xQlJ 

(4.1b) /i.F-,(r)= sup (r,a;/(7r). 

{o-,T)6SpXQi, 

Equipped with these notions, we are now ready to state our main result. 
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Theorem 4.1. Let A € {0, l}"^^" be a separately exchangeable array 
generated by some lo in accordance with Definition 2.2, and consider fitting 
a K-class stochastic co-blockmodel to A. Then for each K and each ratio 
m/n, there exists a universal constant C such that as ?i — )• oo, 

p( max \ sup l/ix-A (r) - /ij-^ (r)| I > I = o(l), 

The geometric implication of Theorem 4.1 is as follows. 

Corollary 4.1. Under the assumptions of Theorem 4-1, the convex 
hulls of F^j^ and F^j^ converge in the Hausdorff metric at rate C'p(n~^/^). 

Proof of Corollary 4.1. Denote the Frobenius or Hilbert-Schmidt 
metric on M^^^ induced by (•, •) as d{F,F') = ( X^^ ,, \Fab - Kbl'^Y^^ ■ The 
Hausdorff distance dHaus(") ■) between sets F,F' G M^^^ is then given by 

c^Haus(-^i -^') = max < sup < inf d(F,F') > , sup < inf d(F,F') 

Recall that it measures the maximal shortest distance d between any two 
elements of F and F'. Given non-empty, totally bounded F, F' C M^^^, it 
holds, with conv denoting the convex hull and || • || the Frobenius norm, that 

c?Haus(conv(J'),conv(J"')) = sup \hjr{T) - hjr,(T)\, 

l|r||=i 

(see, e.g., [25], as applied to the convex hulls of the closures of F and of F'). 
Thus by Theorem 4.1, 

max dnaus (conv(J^fj,), conv(J^j,)) = Opfn"^/''). 

□ 

The result of Theorem 4.1 can be directly related to work in [6, 7], a pair 
of articles which explores in depth the notion of a graph limit as n goes 
to infinity, and the statistical applications thereof as discussed in [5]. Very 
broadly speaking, [6, Theorem 2.9] and [7, Theorem 4.6] analyze sets which 
resemble yj^^^F^^ and U^^^F^^, and are termed quotients. For these sets, the 
authors show convergence in the Hausdorff metric at rate C'((log(n))~^/^) 
for a distance dn known as the cut metric, and detail many implications 
thereof. By fixing /x and v, as required by our M-estimates, we are studying 
what those authors term the microcanonical quotients. By restricting con- 
vergence to the convex hull in Theorem 4.1, as allowed by our M-estimates, 
we gain an exponentially faster bound on the rate of convergence. 
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5. Proof of Theorem 4.1. Our proof strategy is inspired by [6] and 
adapts certain of its tools, but requires to be studied separately for each 
choice of ^ and v in order to yield Corollary 4.1 and the oracle inequalities 
of Theorem 3.1. Most significantly, we do not use the Szemeredi regularity 
lemma, which typically features strongly in the graph-theoretic literature, 
and provides a means of partitioning any large dense graph into a small 
number of regular clusters. Results in this direction are possible, but instead 
we use a Rademacher complexity bound for U-statistics adapted from [11], 
allowing us to achieve the improved rates of convergence described above. 

5.1. Establishing pointwise convergence. The main step in proving The- 
orem 4.1 is to establish pointwise convergence of hjrA (F) to hjruj^(r) for any 
fixed r. We do this through Proposition 5.1 below, after which we may ap- 
ply it to a union bound over a covering of all r G [-1, 1]^^^ to deduce the 
result of Theorem 4.1. Appendix B provides a formal statement and proof 
of this argument, along with proofs of all supporting lemmas. 

Proposition 5.1 (Pointwise convergence of hjrA^(T) to /ijra;^(r)). As- 
sume the setting of Theorem 4-1, fixing m = pn. Then there exist constants 
CK,nK such that, given any T e [-1,1]^^^, /i, v, oj, and A G {0,1}™^" 
generated from uj, it holds for all n > uk that 

h^A{T) - hr^jT)\ > ^) < 2e-v^[2p/(p+i)] ^ ^(^^^ _ 

Proof of Proposition 5.1. To obtain the claimed result, we must es- 
tablish lower and upper bounds on the support function hjrA^ (P) that show 

its convergence to hjrui^{T) at rate Op (n"-*^/^). Recalling the definitions of 
hjrA (r) and hjrt^^^T) in (4.1), we first require a statement of Lipschitz con- 
ditions on {r,A/ST) and (T^uj/ctt). Its proof follows by direct inspection. 

Lemma 5.1. Define for measurable mappings a, a' over [0, 1] the metric 
dHam(o-,o-') = / l{a {x) ^ a' {x)] dx , 

J[0,1] 

and analogously the standard Hamming distance for sequences, with respect 
to normalized counting measure. Then for any T G [-1, l]^x-f^ and A, A' G 
[0,1]'"^", with {S,T,LJ,a,T) as defined in Section 4-1, we have that 

1. \{T,A/ST) - {T,A/S'T')\ <2[dn.n.{S,S')/m + dn.ra{T,T')/n]; 

2. \{V,Uj/cJt) - {T,Uj/(j't')\ < 2[dHam(fT,^') + ^^Ham(T,r')]; 
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3. \{T,A/ST) - {T,A' /ST)\ < l/{mn) if A, A' differ by a single entry. 

In conjunction with McDiarmid's inequality, these Lipschitz conditions 
yield the following lower bound on hjrA (F), proved in Appendix B.l. 

Lemma 5.2 (Lower bound on h-pA (L)). Assume the setting of The- 
orem Then there exist constants C'j^,n'j^ such that, given any T G 
[— 1,1]^^^ , fijU, CO, and A £ {0,1}P"^^"' generated from io, for all n > n'j^, 

P l^h^.^{r) - V. (L) > ^) < 2e"v^Pp/(p+i)] [1 ^ ^(1)] 

The upper bound comes by way of Rademacher complexity arguments. 
The remainder of this section and Appendix B is devoted to its proof. 



Lemma 5.3 (Upper bound on hjrA (L)). Assume the setting of The- 
orem 4-1- Then there exist constants C'^,n'^ such that, given any T G 



-1,1]^^^ , ^,v,ui , anc? ^ G {0, 1}''"^" generated from uj ,p for all n > n'^ 



P {hrA{T) - h^.jT) > < 2e-v^Pp/(p+i)] ^ ^(^^j 

Proposition 5.1 now follows simply by combining Lemmas 5.2 and 5.3. 

□ 

5.2. Establishing an upper bound on hjrA (T). Lemma 5.3 represents the 
main technical hurdle in obtaining the polynomial rate of convergence given 
in Theorems 3.1 and 4.1. To illustrate the main ideas as clearly as possible, 
we will introduce our Rademacher complexity arguments below for the case 
K = 2, deferring the necessary generalizations to Appendix B. 

We first define W e [0, l]'^><" with reference to Definition 2.2 as 

Wij = uj{Ci,Cj), i G 1, . . . ,m, j G 1, . . . ,n; 
and then define, in direct analogy to hjrA^(T), 

h^w (r) = max (r, W/ST) = max i V V WnTsumi) \ ■ 

(S',r)eQ™xSU (S',T)eS™xs;j l^mn ^' ) 

The matrix W serves as an empirical realization of the mapping lo, with its 
support function hjrw{T) defined with respect to co-blockmodel partitions 
(5, T) G X Q". As proved in Appendix B.2, Lemma 5.4 enables us to 
bound I hjrA^ (L) - E hjrw (L) | using the Lipschitz conditions in Lemma 5.1. 
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Lemma 5.4. Fix some measurable tu : [0, 1]^ — ^ [0, 1], with W G [0, 1]"^^" 
generated by uj and A G {0, 1}"^^" generated by W , and some T G [—1, 1]^^^. 
Then for any e > 0, 



(5.1) P (\h^A (r) -¥.h^w{T)\ > 2e) < 2e-2'^"^'/(™+") + 2K 



^m+n —2mne 



Having bounded |/ijpA^(r) — E/ijriv(r)|, we must upper-bound E/ij-w(r) 
in terms of hjruj^(T). We will do this in a series of steps, first bounding 
E/ijrw(r) using a result adapted from [2] and proved in Appendix B.3. 

Lemma 5.5. Let I and J be sets of deterministic size, whose elements 
are sampled without replacement from 1, . . . , m and 1, . . . , n. Let W be gener- 
ated as in Lemma 54, and fix T G [-1,1]^''-^. GivenW,X,J, and {Q,R) G 
Q]^ xQu, let = 5^'^'^ and = f '^'^'^ denote partitions satisfying 

(5.2) 5« G argmax | V V W^.T^.^ ^ \ , 

(5.3) f Q G argmax | V V WijTQ(i^T[j) | • 
Then 

{bA) ¥.hjrwiT) max (T ,W / S^f^)] + + \J\-^\ . 



To bound the right-hand side of (5.4) relative to hjruj^ (F), we will introduce 
an additional construction comprising several steps. Specifically, for fixed 
{Q,R) and L, we will define function classes Qjj and Qv, and a random 
functional Gar which approximates {T,W/ S^T'^) for some (cr, f) G Qu x 
Qv- By a Rademacher complexity argument, G^-f will concentrate for all 
{Q, R) near its expectation, which itself will be bounded by hjruj^[T). 

For the case K = 2, define U by 

U{x) = ^Uj{xXj)(rm(j) - T2R{j))- 



It follows that 



G argmax yc/(ei)l{5(i) = 1}, 



i=l 

and so 5^ will assign to class 1 the fiim largest elements of f/(^i ),..., [/ (Cm)- 
If [/ is invertible, this set can be written {S^i : U{S,i) < t} for some t. To treat 
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non-invertible U, define Qjj to be the class of functions : u G [0, 1]}, with 
lu a one-sided interval on the range of U with lexicographic "tie-breaking" : 



'i-u{x) 



2 if either U{x) < U{u), or U{x) = U{u) and x < u] 
1 if either U{x) > U{u), or U{x) = U{u) and x > u. 



Then there exists a G Qu such that can be chosen to satisfy 

S'^{i) = am, i = l,...,m. 
Let V denote a function defined analogously to U as follows: 

viy) = ^^i^i,y)i'^Q{t)i -rQ(i)2), 

and likewise define Qy so that there exists f G Qv such that can be 
chosen to satisfy 

^«(i)=-^(0), J = l,...,n. 
We are now ready to define G^r- Given any a £ Qu and r G Qy, let 

where I is the complement of X in {1, . . . , m}, and the complement of J 
in {1, . . . , n }. Comparing G^r to Lemma 5.5, we see that G^-f well approxi- 
mates (r, W/ S^T^) whenever |X| and are small; and indeed, we will later 
set \I\ = = n^/^ in order to obtain an upper bound for hjrA {T)—hjruj^ (T). 

By construction, the random classes Qu and Qv are independent of the 
random variables {■^jjjgj and {Cj}i^j appearing in the summand of G^t- 
As a result, we may bound the deviation 6uv of Gar from its expectation, 

6uv= sup |C(C,C)-IE(G,.(e,C)|C^,^)l, 

{cr,T)&QuXQv 

using Rademacher complexity results for U-statistics [11, Lemma A.l], [17], 
applied to the class of one-sided interval functions. 

Lemma 5.6. Assume the setting of Lemma 5.5, and set I = min(m — 
|X|,n — \ J\)- Then the deviation buy of G(,t from its expectation satisfies 



E max 6uv < 4 



( |X| + |J| ) log + 2 ) log(£ + 1) + log 2 



(Q,H)eS'"xS'j /V 2£ 
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Lemma 5.6 is proved in Appendix B.5 to hold for arbitrary K, under the 
appropriate generaUzation of Qu, Qy, and quantities that depend on them. 

Similarly, we may bound (5[/, defined for K = 2 as the maximum discrep- 
ancy between the expected and empirical class frequency in Qjj: 



6u = sup < max 



l<a<K 



\a~\a)\ 

m 



1 1 

-Y.l{a{i.) = a} 

^tr J 



with Sy defined mutatis mutandis. We then have the following result, proved 
for arbitrary K (with appropriate redefinitions of 5v) in Appendix B.6. 

Lemma 5.7. Assume the setting of Lemma 5.5. Then 



E max <5t/ < 4 



{\J\ + l)logK + {^) log(m + l) +log2 



ReQH I V 2m 



E ( max (5y < 4 



(|X| + 1) log K + (^) log(7i + 1) + log 2 
?is™ J ~ '\ 2n 

We state and prove a final auxiliary lemma prior to the proof of Lemma 5.3. 
Lemma 5.8. Assume the setting of Lemma 5.5. Then 

Ef max (T,W/S^f'^)] -h^o. (T) <2{m~^\I\+n^^\J\] 

+ E f max 6uv) + 2i^ E f max Su + 5v) ■ 

Proof. Let a and f denote the mappings in and Q^, that are re- 
spectively closest in the metric duam to a and f. Observe that we may then 
expand and upper-bound the left-hand side of the lemma statement by 

E (^max(r,T^/5^TQ) -G^^(e,C)) +IE (^max G^^(e,C) - (r,^/<7f)^ 

(i) (ii) 

+ E (^max(r,w/<Tf) - (r,^^/^f)^ +E (^max (F, a;/<Tf )^ - h^^^{T), 

(iii) (iv) 



after which we may upper-bound terms (i)-(iv) in turn as follows. 
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First, since \uj{x,y)T^(^^)f(^y-^ \ < 1 for all {x,y), it follows from their respec- 
tive definitions that {T,W/ S^T^) — GaT{£,,C) is deterministically bounded 
above by \I\/m + {Jl/n. Hence, term (i) is bounded by the same quantity. 

Second, observe that by definition, G^-f C) — E (G^-f () \U,V) < 5uv 
Since for fixed a,T we have K {Gari^X) \U,V) = [\I\\J'\/{mn)~\ {T,uj/aT), 
with \{r,uj/crT)\ < 1, it holds deterministically that K{GaT{^X)\U,V) — 
{T,uj/af) < \I\/m + \J'\/n. Thus term (ii) is bounded above by the quantity 
E ( max(Q^^)gQmxQiJ ^uv) + \L\lm + \J\ln. 

Third, by the second Lipschitz condition of Lemma 5.1, we have that 
(r,a;/(Tf) - {T.u/ctt) < 2 [dna.mi^,cr) + ^Ham (tjt)]. Observe that 



K K ^ m 

dn.rn{o,c7)<Y,\\^~\o)\-^^a\ < J]] 1 1'^"' («) I Y.^^^^^'^ = <K6u, 



1 1 m 

a=l a=l «=1 



where the second inequality holds as 5^ G Q™. By the same argument for 
'^Ham(''^, t), we See term (iii) is bounded by 2K'K ( max(Q j:j)ggmxQn Su + Sy)- 
To conclude, note term (iv) is deterministically upper-bounded by 0. □ 

We may now establish the claimed upper bound on hjrA^{T) — hj^^^iV). 

Proof of Lemma 5.3. Combining the results of Lemmas 5.4-5.8 yields 
directly that, with probability at least 1 - 2e-2™"^VH+'^) _ 2i;s:"^+"e-2"^"^' , 

h^A^{T)-hr^^{V) < 2e+KV2^{\I\~'^/^ + \J\^^l'^]+2{m-^\I\+n-^\J\] 
+ /(|X| + |JU,2(f))+2i^{/(|X| + l,n, (^)) +/(|J|+l,m,(f))}, 

1 /2 

where f{p,q,r) = A{[plogK + rlog(g-l-l) + log 2] /{2q)} ' , and^ = min(m— 
— \ J'\) as in Lemma 5.6. Letting e= n~^/^, \I\ = \ J\ = n^^'^, and fixing 
m = pn as assumed in the hypothesis of Lemma 5.3, it follows that for n > 2, 



+ 



4 + 12{K^ log(/3n + 1) + 2)V2 



n 



1/2 



with probability at least l-2e"v^[2p/(p+i)] _2_fi:(p+i)ne-2p"''" . Thus we have 
that hjrA^{T) is bounded above by /ijri^^(r) + C'p(n~^/^), as claimed. □ 



16 



CHOI & WOLFE 



6. Discussion. In this article we have addressed the case of network co- 
clustering^ in which the inference task is to group two sets of network nodes 
into classes based on their observed relations. Our results significantly gen- 
eralize known consistency results for the blockmodel and its co-blockmodel 
variant: they do not require the data to be generated (even approximately) 
by a co-blockmodel, and they achieve improved rates of convergence relative 
to results from the graph limits literature, through the use a Rademacher 
complexity bound for U-statistics adapted from [11]. The assumption of a 
nonparametric generative model is both more general and more realistic, 
and to our knowledge Theorems 3.1 and 4.1 are the first for this regime to 
establish polynomial rates of convergence. 

In [11], these Rademacher complexity results are used to derive conver- 
gence rates for learning pairwise rankings. This setting is related to ours, 
but differs in some important ways. In [11], a rule r : x — )• { — 1, -|-1} is 
desired such that, given X,X' £ X, r indicates which has the higher rank. 
In this setting, X and X' can be thought of as covariates describing the two 
objects for which a relative ranking is desired, and X represents the space of 
allowable covariate values. In our network setting, the nonparametric model 
Lo : [0, 1]^ — > [0, 1] is analogous to a ranking rule, with taken to be [0, 1]. 
However, X and X' are never observed in the data, and effectively must be 
imputed up to measure-preserving transformation. 

The recent work of [15] analyzes the consistency of co-clustering with 
model misspecification, but in a rather different setting, with the data ma- 
trix A assumed to be real valued, along with a real-valued generalization of 
the co-blockmodel. This generalization utilizes discrete latent class variables 
S and T; conditioned on S{i) and T(j), the distribution of Aij is assumed to 
have mean ^5(i)T(j)5 but may otherwise be arbitrary up to technical condi- 
tions, and may be misspecified in the estimator. Under these assumptions, it 
is shown that the latent classes can be estimated consistently if their number 
is known. In the case where A is binary, the conditions of [15] are equivalent 
to assuming a generative co-blockmodel with known number of classes. 

Finally, the very recent work of [9] derives a simple and elegant spec- 
tral method to consistently estimate the matrix W defined in the proof of 
Lemma 5.3 in Section 5.2; i.e., the mapping uj{x,y), evaluated at the values 
of the latent variables Ci) • • • ) Cm, and Cii • • • , Cn- This implies consistency 
of estimation of uj in the sense; and while rates of convergence are not 
given for general oj, they can be established for particular instances, such as 
under the assumption of a generative blockmodel whose number of classes 
K is growing with n. Our setting is distinct, in that we desire only the best 
blockmodel approximation to uj, and so are able to establish rates of 
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convergence that are independent of oj. 

APPENDIX A: PROOF OF THEOREM 3.1 

To prove Theorem 3.1, we first relate the objective functions of (3.1) 
and (3.2), and the corresponding Ruj{4>) and L^j ((/>), to the support functions 
hjrA (•) and hjr^^[-). The result then follows directly from Theorem 4.1. 

Lemma A.l. Assume the notation of Theorem 3.1, and let Ra{<P) be the 
objective function of (3.1). Then, given <j) = G ^, 

m n „ 

R^{^)-R^{cj))=2[h^.^{9)-h^Aie)]+—J2Y.^^i- / '^i^,y?dxdy. 
L fii. J mn ^ — ^ ^. — ^ J[o 112 

i=l j=l ^ ' J 

LetLA{(l)) be the objective function of (3.2). Define Bg G M+,re G [-1, l]^x^ 
to satisfy Be T0{a, b) = log(6'afe/(l - Oab)) for a,b = 1, . . . , K . Then 

La{4>) - LM = Be [h^AiTe) - h^.^iTe)] . 

To complete the proof of the first part of Theorem 3.1, let (p = {fi^v^O) = 
argmin^g^ RAi4>)- For any (j) = (i^i, v, 6) in 

Ru.{4>) - RM) = RM - Ra{^) + Ra{^) - Ra{^) + Ra{^) - RM) 

< RU^) - Ra{^) + Ra{^) - RM 

< 2\h^A{e) - h^^J)\ + 2\h^^^9) - h^A{9)\, 

where the first inequality holds because Ra{<P) — Ra{<P) < 0, and the second 
holds by the triangle inequality and Lemma A.l. Applying Theorem 4.1 and 
choosing (p to satisfy Ruj{4>) < inf^/g$ Ri^{(j)') +n~^/^ then yields the claimed 
result that R^{(j)) - inf^gcj, RU^I^) = Cp (n-^/^). 

Now set (j) = argmax^g^ Li^((/)) and cj) = argmax^g^ (</>); the result 
[Lu^i^) - L^{^)]/[B{9) + B{9)] = C'p(n-i/4) follows similarly from 

< LU4>) - = LU^) - La{4>) + La{4>) - La{^) + La{4>) - LU^) 

< LM - La{^) + La{$) - LM 

< B{e)\h^^^^{Tg) - h^A{rs)\ + B{9)\h^A{T^) - V;^,(r^)|. 
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Proof of Lemma A.l. We show the results of the lemma directly: 



mm 



h{i)T{j) - Aj I 



RA{(t>) = min — 

{K K \ ^ m. n 

a=l 6=1 ) i=l j=l 

a=l 6=1 ) i=l j=l 



where the second line follows from the definition of -7^^, and the last line 
from that of /ix-a . To complete the first result of the lemma, observe that 
by letting a and r satisfy a{x) = F~^(ni{x)) and T{y) = Fj7^ (712(2/)), 

Rui{(l>) = inf / \uj{TTi{x),Tr2{y)) - uj^ix,y)\'^ dxdy 

71-1 ,^26^ 7 [0^1] 2 

K K ^ 

inf YY \uj{x,y) - dab\^ dxdy 



(<T,T)eQ^,xQ^ ^ ^ Ja-i(a)xr-i(fe) 

(jj{x,y)^ dx dy 

F&T^„ I ^ ' ""I ./rnil2 



= J -2hr^ (^) + E E ^'-''b^^^b \ + / y? dx dy. 

Following similar steps, we show the second result as follows: 

^ m n 

La{cP) = max {^'^ log(^5WT0)) + (1 - ^ij) log(l - %)Ta))} 



K K 

Be hjrA^ {Te) + E E ^^'^^^ log(l - 6'afe), 



a=l 6=1 
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since max^gj-A J2a b ^abBe ^eia, b) = Bg hjrA (Tg), and similarly 



L^{4>) = sup / {u}{tti{x), ■K2{y)) log uj^{x,y) 

+ [1 - a;(7ri(x), 7r2(y))] log (1 - uj^{x, y))} dx dy, 

K K I. 

= sup XlXl / {x,y) log Oab 

+ (1 - y)) log(l - Oab)} dx dy 
K K 



^^P, X] l^*^^ J + ^aJ^b log(l - Oab)\ 

^^■^Mf a=l f)=l I V afe/ J 

K 

Be hr^^ {Te) + X X - ^afe)- 



a=l b=l 



APPENDIX B: AUXILIARY PROOFS FOR THEOREM 4.1 



□ 



Below we provide proofs of all supporting lemmas for Theorem 4.1, and 
state and prove the covering argument used to establish the theorem. 

1. First, in Sections B.1-B.3 below, we prove auxiliary Lemmas 5.2, 5.4, 
and 5.5 as stated in Section 5. 

2. Then, in Section B.4, we generalize the definitions of Qu and Qy, given 
in Section 5.2 for = 2, to arbitrary K] this induces generalizations 
of the quantities 5u,5v-, and 5uv in the natural way. 

3. Then, in Sections B.5 and B.6, we prove Lemmas 5.6 and 5.7, which 
depend on {Qu, Qv ■, ■, ■, ^uv) as defined for arbitrary K. 

4. Finally, in Section B.7, we extend the pointwise convergence result of 
Proposition 5.1 by way of a covering argument for all F G [— 1, 1]^^^. 

B.l. Proof of Lemma 5.2. For fixed F, let (cr*,r*) G x satisfy 

1 



(B.l) (F,a;/aV*)>/i^.^(F) 

so that uj/(T*T* is within n~^/^ of the supporting hyperplane. Define 

S*{i) = a*{i,), r*(i) = r*(0); i = 1, . . . , m, j = 1, . . . , n. 

By the arguments of Lemma 5.4 as proved in Section B.2 below, applying 
McDiarmid's inequality with the Lipschitz conditions of Lemma 5.1 yields 

(B.2) P(|(F,A/5*r*) - {T,uj/a*T*)\ > 2e) < 2e-2'""^'/("^+") + 26-2"*"^". 
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While {S* ,T*) many not be in Q!^ x Q", a Chernoff bound implies that 



> e] <2e 



a = l,...,K. 



The analogous bound also holds for |r* ^ (6) /n — Vb\. Applying these results 
in conjunction with a union bound yields 



( max < 

\l<a,b<K I 



+ 



Therefore, with probability at least 1 — K{2e ^"^"^^ + 2e ^"'^^), there exists 
a pair {S, f) e Q]^ x Ql such that 

-dHam(S*,5) + -dHam(T*,f ) < 2Ke, 

m n 
which by the first condition of Lemma 5.1 implies that 

(B.3) I (r, A/Sf) - (r, A/S*T*)\< 4Ke. 

Recalling that hjrA^ = max(5'r)ggmxQn(r, jd/ST), we have that 

h^A{r)>{T,A/Sf), 

following which (B.3), (B.2), and (B.l) in turn imply that with probability 



at least 1-26-^™"' / {m+n) _2^-2mne 



K(2e^^™^ +2e-^"^ ), we have h^A (T) 



> {T,A/S*T*) - AKe 

> (r,w/(T*r*) - {AK + 2)e 
>/i^;.^(r)-n-i/4_(4ir + 2)6. 

Now letting m = pn as in the statement of the lemma, and setting e = n~^/^, 
we see that with probability at least 1 - 2e-v^P^/(''+^)l [1 + o(l)]. 



h^A{T) > h^.^{T) 



n 



providing a lower bound on hjrA (F) that converges to hjn^^iV). 
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B.2. Proof of Lemma 5.4. Recalline the definitions of /ix-a and hirw , 

^-F£(r)-V.^(r)|>6) 

max (r, A/ST) - max (F, W/ST) > e ) 
(5,T)eQ™xQ.!." {S,T)6Q™xQ;} j 

<p( max Kr,^/sr)-(r,i^/sr)|>e) 
(B.4) < p(|(r,A/5r)-(r,Ty/sr)| >e), 

wliere (B.4) follows from a union bound. 

Now consider (F, A/ ST) as a function of the mn independent random vari- 
ables {Aij], and observe that ¥.{{T,A/ST)) = {T,W/ST) for each (S,T), 
since VFjj = oj{^i^C,j) = E(Ajj). Also recall the final Lipschitz condition of 
Lemma 5.1, which states that \{T,A/ST) - {T,A' /ST)\ < l/(mn) if ^ and 
A' differ by a single entry. Thus we may apply McDiarmid's inequality to 
{T,A/ST) in the summand of (B.4), and since \Q^\ < and |Q;]| < K"", 

^m+n n —2mne^ 



V.(r)-Vw-(r)|>ej <i^-+".2e 

Now consider hjrw{T) = max(^s,T)£Q'pxQ]i(^jW/ST) as a function of the 
m + n independent random variables Cii • • • ) Cm and (i, . . . Xn- Changing 
a single component of ^ or <^ affects a single row or column of W, respec- 
tively, and thus alters {T,W/ST) and hence hjrw by at most 1/m or 1/n. 
It therefore follows from McDiarmid's inequality that 



Vw-(r)-EVw;(r)| >e <2e 



-'2mnt'^ / (m+n) 

Combining these inequalities via a union bound yields the statement of the 
lemma, since by the triangle inequality we must have \hjrA {r) — hjrw(T)\ > e 
or \hjrw{r) -Ehjrw{r)\ > e in order that \hjrA (F) -Ehjrw{r)\ > 2e. 



B.3. Proof of Lemma 5.5. Recall from the statement of the lemma 
that I and J denote sets of deterministic size whose elements are sampled 
without replacement from 1, . . . , m and 1, . . . , n, respectively. We adopt the 
notation that Ej denotes an expectation taken over I, with all other random 
variables held constant, and define Kj and Exj in the same manner. 
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To prove the lemma, it suffices to show that for all W, T, S, 
(B.5) Ej ({T, W/S^T)) > (r, W/S^T) - K^2tt/\J\ 



(B.6) Ex (^{T, W/Sf^)j > (r, W/ST^) - Ky^27r/\I\, 

where S'^ and are respectively defined in (5.2) and (5.3), and 
= argmax(r, W/ST) , = argmax(r, W/ST) . 

This is because (B.5) and (B.6) imply that for ah {U, V) e Q"^ x Ql, 

(r, wjuv) < (r, w/UT^) 

< Ex ((F, w/s^''f^)) + If ywm 

< Ex Ej ({T, W/S^''f^)) + Ky'2Tr/\I\ + i^VWI^I 



<Exj( max (T,W/S^f^))+KV2^(\l\~^ + \J\~^). 

Recalling the definition of hjrw (F) , and noting that the right-hand side above 
is deterministic for fixed W, with no dependence on U or y, we may write 

h^w(r)= max {r,w/uv) 
{c/,y)es™xs^ 

<Exj( max {r,W/ S^f^)) + kV2^ + \J\~^2) . 

Taking expectations on both sides over W gives the statement of the lemma. 

We now establish (B.5), noting that (B.6) will follow by parallel argu- 
ments. For fixed W and T, define for any a = 1, . . . , K the difference 

1 1 " 

It follows that Ej (Af) = 0, and by a Chernoff bound, 

IP(|A^| >t)< 2e-2*'l^l. 

As |Af| is nonnegative, the identity E{X) = /o°° P(X > t)dt for X taking 
only nonnegative values can be used to bound its expectation according to 



E^(|A?|)< VV(2|J|), 
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which imphes 
(B.7) 

For fixed W and J, define the function 



Ej( ma^ m)<Ky/7r/{2\J\). 

\ l<a<K / 



^ m 

fw{S, T) = —— WijTs{^)T{j), 

I i=i jej 



and for fixed W and T, let 



(B.8) 



^= ^%l^\fw{S,T)-{T,W/ST)\. 
5eQ™ 



From the definition of A it follows that 
1 



A = max 

SeQ™ I m 



< 



1 



m 



™ / 1 1 A 

i=i I j=i J 

{1 1 " 



1 

m 



Em, 
Ka 
i=l - 

m 

Emax I A? I 
l<a<K * 



WijTaT{j) 



i=l 



< 



— max lAf I . 

m ^ l<a<K 
i=l 

Taking expectations of both sides over J' and substituting (B.7) yields 

^ m 

(B.9) E^(A)<-J]E^(^max^|Af|) <KvW(2^^. 



i=l 



l<a<K 



Finally, to show (B.5), observe that since S'^ from (5.2) maximizes fw{',T), 
and S'^ as defined above maximizes (F, W/ • T), we have from (B.8) that 

< (F, W/S^T) - (F, W/S^T) 
< {r,W/S^T) - fw{S'',T) + fw{S^,T) - {T,W/S^T) < 2A, 

and so (F, W/S'^T) > (F, W/S'^T) - 2A. Taking expectations of both sides 
of this expression over J, and then substituting (B.9), yields the inequality 



Ej i^{T,W/S^T)j > (F,H^/5^r) - 2Ky^7r/{2\J\), 
which is the statement of (B.5). That of (B.6) follows by parallel arguments. 
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B.4. Definition of Qu and Qy for arbitrary K. In order to rede- 
fine Qu and Qy to accommodate arbitrary K, we first redefine the map- 
pings U and V. Given (^j = : j £ J^} and an assignment function 
R:{l,...,n}^{l,...,K}, define the mapping U : [0, 1] by 

Ua{x) = '^Uj{x,CjWaR(jy, xe[0,l], a = l,...,K. 

Analogously, given 

and Q, define V : [0, 1] ^ by 
Va(y) = ^a;(^i,y)rQ(j)a, yG[0, 1], a = l,...,ii:. 

Given a,b G {1, . . . , iiT} and the mapping U, define the relation by 

^Uab ]Ua{xi)-Ui,{x2)>Ua{x2)-Ui,{xi), Or 

\C/a(xi)-C/fe(x2) = C/a(x2)-C/fe(xi), if (a-6)(xi-X2) >0. 

Informally, xi ^^''^•^ 3:2 implies that, given the choice of assigning either 
xi or X2 to group a, with the other relegated to group b, xi is at least 
as attractive as X2. The latter tie-breaker condition results in a symmetric 
definition: if xi X2, then X2 >^^'^''^ xi. We define y^'"-^^ analogously to 

■^u,a,b^ except that the inequality (a — 6)(xi — X2) > is strict. 

Let S denote the set of symmetric matrices in [0, 1]^^^. Given t ^ S 
and the mapping U, we define the function at : [0, 1] — )■ {!,... ,K} as the 
mapping which satisfies the following 

cj-i(a) = {x:x h^'"'^ tab^ b> a, x ^^''^'^ U V 6 < a}, a = l,...,K, 

with the convention that at is undefined whenever the above rule does not 
map ah of [0,1] to {!,... 

We define the function class Qu as follows: 

Qu = Wt ■ t £ S and at is defined}. 

Given t £ S and the mapping V as defined above, we define >-^'°''^,Tt, and 
Qv analogously. We then have the following. 

Lemma B.l. Given U induced by C,j and R, and given W induced by 
and C, define by (5.2). Then there exists a G Qu such that 

5^(i)=a(6), i = l,...,m. 

Likewise, given V induced by and Q, and given W induced by ^ and C, 
define by (5.3). Then there exists f G Qv such that 

fQ{j) = f{Cj), j = l,...,n. 
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Proof of Lemma B.l. Let be chosen lexicographically from the set 
of all maximizers of (5.2), where S lexicographically precedes «S" if and 
only if S{ii), . . . , S{im) lexicographically precedes S"(ii), . . . , S{im), where 
ii,. . . ,im are in order of increasing , . . . , Ci™ • 

Since maximizes (5.2), it holds for all i,j = 1, ... ,m that 

otherwise switching labels for i and j would increase the value of the objec- 
tive function. As is chosen lexicographically, for any i,j such that 

^5H(j)(Ci) + ^5K(j)(Ci) = t^sii(j)(0) + 

it holds that [S^{i) — S^{j)) {Ci — ^j) > 0, with equality if and only if = ^j. 
Otherwise, switching labels would improve the lexicographic ordering. 
Since / for i / j except on a set of measure zero, it follows that 

(S«)-\a) ^^'-^'^ (^«)-\6), a,b=l,...,K, a ^ b, 

where we have let {S^) ^{a) denote {^i : S^{^i) = a}. As a result, for each 
a and b we may choose tab = tba G [0, 1] such that (5^) (a) ^^'"''^ tab 
and {S^y^{b) tba, implying that S^{i) = a{Ci) for some a £ Qu- As 

parallel arguments hold for T^, the statement of the lemma follows. □ 

B.5. Proof of Lemma 5.6. Recall the definition of 5uv from Sec- 
tion 5.2, which we can now interpret for arbitrary K according to the def- 
initions of Qu and Qv in Section B.4 above. We use a symmetrization 
argument of Hoeffding [11, 17] to bound E (max(Q j:j)ggmxgn Suv)- Let Aix 
denote the set of permutations of 1, . . . , m which map 1, . . . ,m — \I\ to 
i ^ I, and let j- be defined analogously for permutations on l,...,n. 
Let M = Mx X Mj and let Z = \M\. Let (' be identically distributed 
as ^ and (, and independent of U and V. Let and Cj be defined as in 
Section B.4. To abbreviate the notation, let gaT{x,y) = uj{x,y)r^(^x)T{y)j and 
let Q = Q"; X Q]^ X Qu X Qv. It holds for {Q,R) G Q™ x Q^J that 

E (raaxSuv) = E ( sup |G,,.(C,C) - E [G^Ae ,C')\U,V) \\^xXj] , 
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which by convexity can be upper-bounded by 

e[ sup \G^,r{^,C)-GM^'X')\\^x,Cj 
\iQ,R,cT,T)eQ . 



E I sup 

(Q,R,(T,T)eS 



E 1 sup 

{Q,R,(7,T)eQ 



Zmn 



1 

X] 7 X] 9aT{i-K{i),Cr^{j)) - 9aT{di)^C'n{j)) 



TT,riGAi i=l 



since the permutations tt and rj weight each (i, j) term equahy for i ^ I and 
j ^ i7; by convexity again, and then hnearity of expectation, we have 



<E(m E 



sup 



1 



1 ^ 



^xXj]. 



We may now introduce independent and identically distributed Rademacher 
variables ri,...,r£, and use standard Rademacher symmetrization argu- 
ments (see, e.g., [8]) to show that the final quantity above is equal to 



S^E sup 



1 



i=l 



I (.X, Cj 



<m^E sup 



mn 



1 ^ 1 ^ 



i=l 



i=l 



Cx, Cj 



<2SmE sup 

\{Q,R,a,r)eQ 



1 ^ 

"7 ^ ^ ^iQcrri^i^ Ci. 



i=l 



\^x,Cj ■ 



To bound this expectation, note that for fixed X, J', Q, R (inducing a fixed 
U and V), and fixed (d, r) G Q^/ x Qy, a Hoeffding inequality gives 



(B.IO) 



1 ^ 

7X]'^^S'<Tr(^i,0) 



> e I a, < 2e 



We may now apply (B.IO) in conjunction with a union bound over all 
(Q, R, a,T) e Q as follows. For fixed Q, R, a, b, the set {i : '^'^'"■^^ tab} can 
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be chosen at most i + 1 ways by varying tab- As a result, the set ■ ■ ■ ,£,£ 
can be partitioned at most (£+ 1)(2) ways by varying a G Qu. Analogously, 
the set Cij • • • ) can be partitioned the same number of ways by varying 
T S Qv- For fixed T,J^, the functions U and V can be chosen xl-^l+l^l 
different ways by varying Q and R. Hence, a union bound gives 



sup 

AQ,R,a,T)£Q 



i=i J 



2lt 



Since this expression is of the form P(X > t) < f{t) for X nonnegative, we 
may apply the inequality E(X) < min{l, /(t)} dt to yield 



mn 



sup 



1 ^ 



1=1 



< 41 



' (|X| + |J|) log K + 2(^) log(£ + 1) + log 2 
21 



Since the bound holds for any ^j, (^j, the same bound holds when the condi- 
tioning is removed and ^j, (^j are chosen randomly, thus proving the lemma. 

B.6. Proof of Lemma 5.7. To abbreviate notation, let Q = Q" x Qu. 

Let 

fii ■ ■ ■ } f-m be Rademacher variables as in the proof of Lemma 5.6. By a 
standard Rademacher symmetrization. 



E sup < max 



< 2E sup < max 

l(R,.)6Q U<a<K 



As in the proof of Lemma 5.6, a Hoeffding inequality and union bound yield 

\^-\a)-^Y^l{a{i.) = a}^>e\Qj^ 

<i^l^l(m + l)(2)i^-2e-2™^', 



sup < max 
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and applying E{\X\) < mm{lj{t)} dt for F{\X\ > t) < f{t) then gives 



2E sup < max 
\{R,a)eQ U<a<K 



{\J\ + 1) log K + (^) log(m + 1) + log 2 



< 41 



2m 



As in the proof of Lemma 5.6, removing the conditioning on (^j does not 
alter the bound. Parallel arguments apply to r S Qv, and the lemma follows. 

B.7. Covering argument to establish Theorem 4.1. The estab- 
lishment of Theorem 4.1 from Proposition 5.1 proceeds as follows. For C 
[0,1]^^-^^, recah that /ij-(r) = sup^gj-(r,F) = sup^gj- tr(r'^F). By the 
Cauchy-Schwartz inequality, hjr is Lipschitz continuous: 

|V(r)-/i^(r')| < sup |(r-r',F)| <i^||r-r'||. 

Let Be denote an e-cover in || • || for [—1, 1]^^^ , with the closest point in 
Be to a given F. The triangle inequality, Lipschitz condition, and B^ imply 



sup 

re[-i,i]K>iK 



< 



rel 



sup { V.(F^)-/.^.^(F«) 

_l l\KxK L 



< sup 



hr-A (F^) - hr^ (F^ 



+ 2A'||F-F^||| 
+ 2Ke 



max 



+ 2Ke. 



Now let Ck and nx be defined as in Proposition 5.1, and set e = Cx/n^^^- 
It follows by the above relation, a union bound, and Proposition 5.1 that 



max 



sup 



< I 



max 



^-^-(r)-V,".(r)||>36j 



VA(F)-/i^.^(F)| >6 
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for all n > uk- The result of Theorem 4.1 then follows, since we have that 
\Q.n\ = ("^^7^), and B^jk can be chosen such that \B^jK\ < (1 + -f^^/e)^^ 
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